John Wilshire, Jono Chan :)
17/08/2017
That was scraped from:
Games from 2008/2009 season up to - 2015/2016 season
We took a subset of this, the EPL * ~3k matches * 1.3k epl players * 35 teams (teams are promoted and relegated out of the epl)
total 380 games per season
## [,1]
## number_of_games "3040"
## min(date) "2008-08-16 00:00:00"
## max(date) "2016-05-17 00:00:00"
Distribution of scores
Soccer is very low scoring
For each team, (home, away) we have their 11 player lineup. We can join this with the player statisitcs table and using the closest assessment before the game we can then aggregate and use these scores as a measure of how well we think a team is.
Model on everything
glm(home_win ~ . , family = binomial(), data = epl4 %>% select(-outcome, -matches('goal|outcome'))) -> home_full_glm
summary(home_full_glm)##
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = epl4 %>%
## select(-outcome, -matches("goal|outcome")))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1272 -0.9972 -0.5865 1.0674 2.4146
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.836286 2.579394 -0.324 0.745773
## cumulative_margin_home 0.014607 0.003996 3.655 0.000257 ***
## cumulative_margin_away -0.012342 0.003999 -3.086 0.002026 **
## overall_rating_mean_home -0.024619 0.056367 -0.437 0.662278
## potential_mean_home 0.042015 0.040867 1.028 0.303907
## crossing_mean_home 0.001653 0.017336 0.095 0.924050
## finishing_mean_home -0.008298 0.018770 -0.442 0.658431
## heading_accuracy_mean_home 0.003069 0.020429 0.150 0.880589
## short_passing_mean_home 0.007477 0.031274 0.239 0.811043
## volleys_mean_home 0.031482 0.015981 1.970 0.048850 *
## dribbling_mean_home 0.003786 0.022912 0.165 0.868762
## curve_mean_home -0.008841 0.016549 -0.534 0.593189
## free_kick_accuracy_mean_home 0.035521 0.014564 2.439 0.014727 *
## long_passing_mean_home 0.011767 0.025291 0.465 0.641737
## ball_control_mean_home -0.021327 0.039116 -0.545 0.585604
## acceleration_mean_home -0.009251 0.029511 -0.313 0.753910
## sprint_speed_mean_home 0.001683 0.028344 0.059 0.952647
## agility_mean_home -0.006141 0.020909 -0.294 0.768998
## reactions_mean_home 0.049184 0.029570 1.663 0.096253 .
## balance_mean_home 0.003522 0.015782 0.223 0.823418
## shot_power_mean_home -0.015660 0.019497 -0.803 0.421881
## jumping_mean_home 0.030277 0.016578 1.826 0.067793 .
## stamina_mean_home -0.005951 0.019266 -0.309 0.757396
## strength_mean_home 0.017964 0.021563 0.833 0.404801
## long_shots_mean_home -0.023499 0.020291 -1.158 0.246821
## aggression_mean_home -0.007986 0.015100 -0.529 0.596878
## interceptions_mean_home 0.004172 0.018856 0.221 0.824893
## positioning_mean_home -0.005696 0.019206 -0.297 0.766789
## vision_mean_home 0.022139 0.019647 1.127 0.259820
## penalties_mean_home 0.015169 0.014267 1.063 0.287681
## marking_mean_home -0.015385 0.023646 -0.651 0.515275
## standing_tackle_mean_home 0.011226 0.029581 0.380 0.704307
## sliding_tackle_mean_home -0.037010 0.023222 -1.594 0.110990
## gk_diving_mean_home 0.050944 0.039854 1.278 0.201156
## gk_handling_mean_home 0.076229 0.046216 1.649 0.099067 .
## gk_kicking_mean_home 0.009853 0.016814 0.586 0.557900
## gk_positioning_mean_home -0.034341 0.046920 -0.732 0.464222
## gk_reflexes_mean_home -0.084819 0.046099 -1.840 0.065778 .
## overall_rating_mean_away -0.029729 0.056201 -0.529 0.596817
## potential_mean_away -0.042012 0.039831 -1.055 0.291533
## crossing_mean_away -0.009620 0.016818 -0.572 0.567305
## finishing_mean_away 0.001267 0.018399 0.069 0.945119
## heading_accuracy_mean_away -0.014740 0.019712 -0.748 0.454609
## short_passing_mean_away -0.028182 0.030992 -0.909 0.363179
## volleys_mean_away 0.014514 0.015491 0.937 0.348805
## dribbling_mean_away 0.039078 0.022972 1.701 0.088920 .
## curve_mean_away 0.013912 0.016340 0.851 0.394541
## free_kick_accuracy_mean_away -0.011804 0.014059 -0.840 0.401138
## long_passing_mean_away -0.004881 0.024790 -0.197 0.843920
## ball_control_mean_away -0.045529 0.039363 -1.157 0.247423
## acceleration_mean_away 0.035373 0.029210 1.211 0.225888
## sprint_speed_mean_away -0.059708 0.028065 -2.128 0.033377 *
## agility_mean_away 0.020644 0.020371 1.013 0.310853
## reactions_mean_away 0.013291 0.028801 0.461 0.644447
## balance_mean_away -0.021616 0.015693 -1.377 0.168387
## shot_power_mean_away -0.007514 0.019397 -0.387 0.698481
## jumping_mean_away 0.019873 0.016817 1.182 0.237324
## stamina_mean_away -0.003418 0.019262 -0.177 0.859174
## strength_mean_away 0.041473 0.021206 1.956 0.050501 .
## long_shots_mean_away -0.008512 0.020198 -0.421 0.673461
## aggression_mean_away 0.006474 0.014975 0.432 0.665536
## interceptions_mean_away -0.020670 0.018659 -1.108 0.267959
## positioning_mean_away 0.003029 0.018353 0.165 0.868900
## vision_mean_away 0.009772 0.019392 0.504 0.614330
## penalties_mean_away -0.010008 0.013918 -0.719 0.472074
## marking_mean_away 0.008284 0.023418 0.354 0.723517
## standing_tackle_mean_away 0.018267 0.029391 0.622 0.534264
## sliding_tackle_mean_away -0.016159 0.023148 -0.698 0.485139
## gk_diving_mean_away -0.001992 0.038098 -0.052 0.958308
## gk_handling_mean_away -0.035461 0.045440 -0.780 0.435156
## gk_kicking_mean_away 0.015038 0.016713 0.900 0.368237
## gk_positioning_mean_away 0.001995 0.046934 0.043 0.966097
## gk_reflexes_mean_away -0.005658 0.045912 -0.123 0.901917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4192.1 on 3039 degrees of freedom
## Residual deviance: 3695.4 on 2967 degrees of freedom
## AIC: 3841.4
##
## Number of Fisher Scoring iterations: 4
AIC(home_full_glm)## [1] 3841.443
cat('full model, 72 predictors')## full model, 72 predictors
caret::confusionMatrix(table(predict(home_full_glm, epl4) > 0, full$home_win))## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 1214 586
## TRUE 436 804
##
## Accuracy : 0.6638
## 95% CI : (0.6467, 0.6806)
## No Information Rate : 0.5428
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3169
## Mcnemar's Test P-Value : 3.15e-06
##
## Sensitivity : 0.7358
## Specificity : 0.5784
## Pos Pred Value : 0.6744
## Neg Pred Value : 0.6484
## Prevalence : 0.5428
## Detection Rate : 0.3993
## Detection Prevalence : 0.5921
## Balanced Accuracy : 0.6571
##
## 'Positive' Class : FALSE
##
Reduced model
glm(home_win ~ . , family = binomial(), data = full %>% select(-outcome, -matches('goal'))) -> home_glm
AIC(home_glm)## [1] 3782.579
summary(home_glm)##
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = full %>%
## select(-outcome, -matches("goal")))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1203 -1.0188 -0.6208 1.0809 2.2515
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.623131 1.591838 -1.020 0.308
## cumulative_margin_home 0.016625 0.003808 4.366 1.27e-05 ***
## cumulative_margin_away -0.015635 0.003807 -4.107 4.02e-05 ***
## overall_rating_mean_home 0.113161 0.014920 7.585 3.34e-14 ***
## overall_rating_mean_away -0.094396 0.014637 -6.449 1.13e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4192.1 on 3039 degrees of freedom
## Residual deviance: 3772.6 on 3035 degrees of freedom
## AIC: 3782.6
##
## Number of Fisher Scoring iterations: 4
predict(home_glm, full) -> home_preds
cat('full model, 72 predictors ')## full model, 72 predictors
caret::confusionMatrix(table(home_preds > 0, full$home_win))## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 1210 611
## TRUE 440 779
##
## Accuracy : 0.6543
## 95% CI : (0.6371, 0.6712)
## No Information Rate : 0.5428
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2966
## Mcnemar's Test P-Value : 1.573e-07
##
## Sensitivity : 0.7333
## Specificity : 0.5604
## Pos Pred Value : 0.6645
## Neg Pred Value : 0.6390
## Prevalence : 0.5428
## Detection Rate : 0.3980
## Detection Prevalence : 0.5990
## Balanced Accuracy : 0.6469
##
## 'Positive' Class : FALSE
##
# Summary plot
par(mfrow = c(2,2))
plot(home_glm)## # weights: 18 (10 variable)
## initial value 3339.781358
## iter 10 value 2987.333352
## final value 2986.500537
## converged
## Confusion Matrix and Statistics
##
##
## mn_preds A D H
## A 438 237 255
## D 0 2 0
## H 429 544 1135
##
## Overall Statistics
##
## Accuracy : 0.5181
## 95% CI : (0.5002, 0.536)
## No Information Rate : 0.4572
## P-Value [Acc > NIR] : 1.025e-11
##
## Kappa : 0.1908
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: D Class: H
## Sensitivity 0.5052 0.0025543 0.8165
## Specificity 0.7736 1.0000000 0.4103
## Pos Pred Value 0.4710 1.0000000 0.5384
## Neg Pred Value 0.7967 0.7429230 0.7264
## Prevalence 0.2852 0.2575658 0.4572
## Detection Rate 0.1441 0.0006579 0.3734
## Detection Prevalence 0.3059 0.0006579 0.6934
## Balanced Accuracy 0.6394 0.5012771 0.6134
for both the home team and the away team
##
## Call:
## glm(formula = home_team_goal ~ ., family = poisson(), data = full %>%
## select(-away_team_goal, -outcome, -home_win))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5051 -0.8850 -0.1596 0.5343 3.7227
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.051697 0.610270 -1.723 0.084829 .
## cumulative_margin_home 0.004682 0.001302 3.595 0.000325 ***
## cumulative_margin_away -0.006371 0.001398 -4.558 5.17e-06 ***
## overall_rating_mean_home 0.045638 0.005495 8.306 < 2e-16 ***
## overall_rating_mean_away -0.026601 0.005475 -4.859 1.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3789.7 on 3039 degrees of freedom
## Residual deviance: 3396.9 on 3035 degrees of freedom
## AIC: 9275.9
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = away_team_goal ~ ., family = poisson(), data = full %>%
## select(-home_team_goal, -outcome, -home_win))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4469 -1.2986 -0.1242 0.5701 3.1087
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.019710 0.703846 -1.449 0.14740
## cumulative_margin_home -0.004480 0.001622 -2.763 0.00573 **
## cumulative_margin_away 0.001785 0.001518 1.176 0.23959
## overall_rating_mean_home -0.039281 0.006433 -6.106 1.02e-09 ***
## overall_rating_mean_away 0.054050 0.006260 8.635 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3886.2 on 3039 degrees of freedom
## Residual deviance: 3571.6 on 3035 degrees of freedom
## AIC: 8367.6
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
##
## FALSE TRUE
## FALSE 684 255
## TRUE 966 1135
##
## Accuracy : 0.5984
## 95% CI : (0.5807, 0.6158)
## No Information Rate : 0.5428
## P-Value [Acc > NIR] : 3.635e-10
##
## Kappa : 0.2221
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4145
## Specificity : 0.8165
## Pos Pred Value : 0.7284
## Neg Pred Value : 0.5402
## Prevalence : 0.5428
## Detection Rate : 0.2250
## Detection Prevalence : 0.3089
## Balanced Accuracy : 0.6155
##
## 'Positive' Class : FALSE
##
Our data ended up being very high dimensional, We could explore methods of reducing the dimensionality of our dataset (with PCA)
tornament models, elo-ranking, dynamic bradley terry
choose a sport that has more developed statistics ie, baseball or basketball
Data manipulation: * dplyr * knitr * tidyr * lubridate Graphics * ggplot2 * plotly (Interactive one) * corrplot